Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset v2.0 #461

Merged
merged 130 commits into from
Nov 29, 2024
Merged

Dataset v2.0 #461

merged 130 commits into from
Nov 29, 2024

Conversation

aliberts
Copy link
Collaborator

@aliberts aliberts commented Oct 3, 2024

What this does

This PR introduces a new format for LeRobotDataset, which is accompanied by a new file structure. As these changes are not backward compatible, we increase CODEBASE_VERSION from v1.6 to v2.0.

What do I need to do?

If you already pushed a dataset using v1.6 of our codebase, you can use the conversion script lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py to convert it to the new format.
You will be asked to enter a prompt describing the task performed in the dataset.

Examples for single-task dataset:

python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \
    --repo-id lerobot/aloha_sim_insertion_human_image \
    --task "Insert the peg into the socket."

If you recorded your dataset with one of the manipulator robots currently supported in LeRobot (or your own implementation), you can provide its configuration path to add the motor names and robot type to the dataset info using the --robot-config option:

python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \
    --repo-id aliberts/koch_tutorial \
    --task "Pick the Lego block and drop it in the box on the right." \
    --robot-config lerobot/configs/robot/koch.yaml

For the more complicated cases of one task per episode or multiple tasks per episodes, please refer to the documentation in that script.

Motivation

Current implementation of our LeRobotDataset suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:

  • The structure of the files does not accurately reflect the data structure. Our datasets are structured by episodes, which contrasts with a typical ML scenarios with train/val/test splits (although these concepts can still be relevant here). This makes it hard to easily select a subset of episodes from a dataset since the whole dataset has to be downloaded/loaded. Related: #440
  • Due to the current hub's limitations, one can not push a dataset with — at most - more than 10k episodes (less if there are multiple cameras).
  • The format is not transparent to the user: in order to get information about the content of a dataset, current options are limited to download the entire dataset and inspect it with a custom script, or try to visualize it using our visualization tool. Related: #383
  • The default file cache system used by datasets and huggingface_hub makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.
  • Some file format used are too framework specific for this format to be more universal (e.g. .safetensors)
  • The dataset viewer on the hub is not compatible with our datasets due to VideoFrame not yet being integrated into datasets.
  • The current implementation lacks support for future features that we may want to add such as:
    • Task-tokens-conditioned training
    • Multirobot policies
    • Depth images (Related: #435)

Changes

Some of the biggest change come from the new file structure and their content:

  .
  ├── data
- │   ├── train-00000-of-0001.parquet
+ │   ├── chunk-000
+ │   │   ├── episode_000000.parquet
+ │   │   ├── episode_000001.parquet
+ │   │   ├── episode_000002.parquet
+ │   │   └── ...
+ │   ├── chunk-001
+ │   │   ├── episode_001000.parquet
+ │   │   ├── episode_001001.parquet
+ │   │   ├── episode_001002.parquet
+ │   │   └── ...
+ │   └── ...
- ├── meta_data
+ ├── meta
- │   ├── episode_data_index.safetensors
+ │   ├── episodes.jsonl
  │   ├── info.json
+ │   ├── stats.json
- │   ├── stats.safetensors
+ │   └── tasks.jsonl
  └── videos
+     ├── chunk-000
+     │   ├── observation.images.laptop
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     │   ├── observation.images.phone
      │   │   ├── episode_000000.mp4
      │   │   ├── episode_000001.mp4
      │   │   ├── episode_000002.mp4
      │   │   └── ...
+     ├── chunk-001
      └── ...

Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses .parquet, .json, .jsonl and .mp4 files (.md for the README).

Added

  • A LeRobotDataset can now be called with an episodes argument (e.g. episodes=[1, 10, 12, 40]) to select a specific subset of episodes by their episode_index. By doing so, only the files corresponding to these episodes will be downloaded (if they're not already on disk). In that case, the hf_dataset attribute will only contain data from these episodes, as well as the episode_data_index.
  • Dataset metadata logic is now handled by the LeRobotDatasetMetadata class. This allows to get info about a dataset before loading the data. For example, you could do this:
# Fetch metadata from the hub
metadata = LeRobotDatasetMetadata("lerobot/pusht")

# Calculate train and val episodes
total_episodes = metadata.total_episodes
episodes = list(range(metadata.total_episodes))
num_train_episodes = math.floor(total_episodes * 90 / 100)
train_episodes = episodes[:num_train_episodes]
val_episodes = episodes[num_train_episodes:]

# Load train and val datasets
train_dataset = LeRobotDataset("lerobot/pusht", episodes=train_episodes)
val_dataset = LeRobotDataset("lerobot/pusht", episodes=val_episodes)
  • Tasks as natural language prompts are now in every datasets and is needed to create one. Every single task of a dataset is listed in the tasks.json mapped to its task_index which is what's actually stored in parquet files. Using the api, they can be accessed either with dataset.meta.tasks to get that mapping or through dataset.episode_dict[episode_index]["tasks"] if you're only interested in a particular episode.
  • Various information about the structure of the dataset have been added and is now centralized in the info.json (keys, shapes, number of episodes, etc.). It serves as a source of truth for what's inside the dataset.
  • episodes.jsonl contains per-episode information (episode_index, tasks in natural language and episode lengths). This is accessed through the episode_dict attribute in the api.
  • LeRobotDataset.create() allows to create a new dataset from scratch, either for recording data or for porting an existing dataset to the LeRobotDataset format. To that end, new methods are added:
    • start_image_writter(): This instantiates an ImageWriter in the image_writer attribute to write images asynchonously during data recording. This is automatically called during LeRobotDataset.create() if specified in the arguments.
    • stop_image_writter(): This is to properly stop and remove the ImageWriter from the dataset's attributes. Importantly: if the image_writer has been set to a multiprocess ImageWriter, this needs to be called first if you want to pass this dataset into a parallelized DataLoader as the ImageWriter class is not pickleable (required for objects to be transfered between processes). This is not needed when instantiating a dataset with __init__ as the image_writer then is not created.
    • add_frame(): Adds a single timestamp data frame to the episode_buffer, which keep data in memory temporarily. Note: this will be merged with the DataBuffer from #445 in a subsequent PR.
    • add_episode(): Saves the content of the episode_buffer to disk and updates metadata for them to be in sync with the contents of the files. This method expects a task argument as a string prompt in natural language describing the task performed in the episode. Videos from that episode can optionally be encoded during this phase but it's not mandatory and can be done later in order to give more flexibility on when to do that.
    • consolidate(): This will encode videos that have not yet been encoded, clean up the temporary image files, compute dataset statistics, check timestamps are in sync with the fps and perform additional sanity checks in the dataset. It needs to be done before uploading the dataset to the hub with push_to_hub().
    • clear_episode_buffer(): This can be used to reset the episode_buffer (e.g. to discard data from a current recording).

Changed

  • The logic for checking timestamps and delta_timestamps sync is taken outside of the __getitem__() and is now done during __init__ or consolidate. This has the benefit of both saving computation during the __getitem__() as well as knowing immediately if there are sync issues with the timestamps.
  • The paths for the parquet and video files are now embedded in the info.json to allow flexibility and to easily split chunks of files between directories to avoid the hub's limit of files (10k) per folder.
  • We now store every datasets (created or downloaded) in ~/.cache/huggingface/lerobot by default. Changing root or setting the LEROBOT_HOME env variable allows to change that location. Every call to the huggingface_hub download functions like snapshot_download or hf_hub_download use the local_dir argument to that location so that files are not duplicated in cache and to solve the issue of having to download again files already present on disk.
  • Refactored the image writing code from populate_dataset.py into an ImageWriter class.
  • stats.safetensors is now stats.json (the content remains the same but it's unflattened).
  • episode_data_index.safetensors is removed but the episode_data_index is still in the api to map episode_index to indices.

Performance

In the nominal case (no delta_timestamp), LeRobotDataset.__getitem__() is on par with the previous version, sometimes slightly improved but still in the same ballpark generally.

__getitem__() call time in seconds (average on 10k iterations):

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0036 | 0.0037
lerobot/aloha_sim_insertion_human       | 0.0029 | 0.0027
lerobot/pusht_image                     | 0.0003 | 0.0003
lerobot/pusht                           | 0.0011 | 0.0009
aliberts/koch_tutorial                  | 0.0111 | 0.0106
lerobot/aloha_mobile_cabinet            | 0.0104 | 0.0101
Benchmarking code
from pathlib import Path
import time
import torch
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset

repo_ids = [
    "lerobot/aloha_sim_insertion_human_image",
    "lerobot/aloha_sim_insertion_human",
    "lerobot/pusht_image",
    "lerobot/pusht",
    "aliberts/koch_tutorial",
    "lerobot/aloha_mobile_cabinet",
]
num_iterations = 10000
logfile = Path(f"perf_log_{CODEBASE_VERSION}_{num_iterations}.txt")
with open(logfile, "a") as file:
    file.write(f"__getitem__() call time in seconds (average on {num_iterations} iterations)\n\n")
    file.write(f"repo_id                                 | {CODEBASE_VERSION}  \n")
    file.write("--------------------------------------- | ------\n")

for repo_id in repo_ids:
    dataset = LeRobotDataset(repo_id=repo_id)
    durations = []
    for i in range(num_iterations):
        start = time.perf_counter()
        item = dataset[i]
        duration = time.perf_counter() - start
        durations.append(duration)

    avg_duration = torch.Tensor(durations).mean()
    results = f"{repo_id} | {avg_duration:.4f}s"
    print(results)
    with open(logfile, "a") as file:
        file.write(results + "\n")

Using delta_timestamps, results are more diverse depending on the dataset but still remain in the same ballpark.
__getitem__() call time in seconds (average on 10k iterations), delta_timestamps=[-1/fps, 0, 1/fps]:

repo_id                                 | v1.6   | v2.0  
--------------------------------------- | ------ | ------
lerobot/aloha_sim_insertion_human_image | 0.0176 | 0.0160
lerobot/aloha_sim_insertion_human       | 0.0073 | 0.0068
lerobot/pusht_image                     | 0.0024 | 0.0032
lerobot/pusht                           | 0.0028 | 0.0043
aliberts/koch_tutorial                  | 0.0200 | 0.0184
lerobot/aloha_mobile_cabinet            | 0.0224 | 0.0181
Benchmarking code (delta_timestamps)
from pathlib import Path
import time
import torch
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset

repo_ids = [
    "lerobot/aloha_sim_insertion_human_image",
    "lerobot/aloha_sim_insertion_human",
    "lerobot/pusht_image",
    "lerobot/pusht",
    "aliberts/koch_tutorial",
    "lerobot/aloha_mobile_cabinet",
]
num_iterations = 10000
logfile = Path(f"perf_log_{CODEBASE_VERSION}_{num_iterations}.txt")
with open(logfile, "a") as file:
    file.write(f"__getitem__() call time in seconds (average on {num_iterations} iterations)\n\n")
    file.write(f"repo_id                                 | {CODEBASE_VERSION}  \n")
    file.write("--------------------------------------- | ------\n")

for repo_id in repo_ids:
    dataset = LeRobotDataset(repo_id=repo_id)
    fps = dataset.fps
    keys = ["observation.state", *dataset.camera_keys]
    delta_timestamps = {key: [-1/fps, 0, 1/fps] for key in keys}
    dataset = LeRobotDataset(repo_id=repo_id, delta_timestamps=delta_timestamps)
    durations = []
    for i in range(num_iterations):
        start = time.perf_counter()
        item = dataset[i]
        duration = time.perf_counter() - start
        durations.append(duration)

    del dataset
    avg_duration = torch.Tensor(durations).mean()
    results = f"{repo_id} | {avg_duration:.4f}s"
    print(results)
    with open(logfile, "a") as file:
        file.write(results + "\n")

Fixes

  • Fix a bug in load_previous_and_future_frames which didn't actually raise an error when the requested timestamps from delta_timestamps did not correspond to actual timestamps in the dataset.
  • Various fixes on the datasets have been made:
    • Some tasks already present in some datasets contained strings which were not part of the task (e.g. "tf.Tensor(b'Do something', shape=(), dtype=string)")
    • Some video files were not properly tracked by git lfs
    • Some datasets present a mismatch between the number of episodes in their parquet and the number of video files. This is being investigated [TODO]
      • lerobot/aloha_mobile_shrimp
      • lerobot/aloha_static_battery
      • lerobot/aloha_static_fork_pick_up
      • lerobot/aloha_static_thread_velcro
      • lerobot/uiuc_d3field
    • lerobot/viola is missing video keys [TODO]

How it was tested

  • Adds tests/fixtures/ in which fixtures and fixtures factories have been added to simplify writing/adding tests. These factories allow the flexibility to create partially mocked objects on the fly to be used in tests, while not relying on other components of the codebase that are not meant to be tested in a particular test (e.g. initializing a dataset using hydra).
  • Adds tests/test_image_writer.py
  • Adds tests/test_delta_timestamps.py
  • Deactivates a bunch of tests which will need to be redesigned and simplified in further PRs.

How to checkout & try? (for the reviewer)

Use an existing dataset:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

REPO_ID = "lerobot/aloha_sim_insertion_human"  # try with '_image' as well

delta_timestamps = {
    "observation.images.top": [-1, -1/50, 0, 25/50],
    "observation.state": [-1, -1/50, 0, 25/50],
}
dataset = LeRobotDataset(repo_id=REPO_ID, delta_timestamps=delta_timestamps)

Try out the new feature to select / download specific episodes:

dataset = LeRobotDataset(repo_id=REPO_ID, episodes=[1, 10, 12, 40])

You can also create a new dataset:

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset

REPO_ID = "your_hf_username/test_v2"

new_dataset = LeRobotDataset.create(
    repo_id=REPO_ID,
    fps=30,
    robot=robot,
    image_writer_threads_per_camera=1,
)

# TODO
frame = {
    ...
}
new_dataset.add_frame(frame)
new_dataset.add_episode(task="Do something")
new_dataset.consolidate()

@aliberts aliberts added ✨ Enhancement New feature or request 🗃️ Dataset Something dataset-related labels Oct 3, 2024
@aliberts aliberts self-assigned this Oct 3, 2024
@Cadene Cadene self-requested a review October 11, 2024 15:10
lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved
lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved
lerobot/common/datasets/lerobot_dataset.py Outdated Show resolved Hide resolved
@Cadene
Copy link
Collaborator

Cadene commented Nov 22, 2024

TODO after merging: #485

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work thanks. Left some comments. Hope it helps :)

.github/workflows/test.yml Outdated Show resolved Hide resolved
examples/1_load_lerobot_dataset.py Show resolved Hide resolved
examples/1_load_lerobot_dataset.py Outdated Show resolved Hide resolved
examples/advanced/2_calculate_validation_loss.py Outdated Show resolved Hide resolved
examples/port_datasets/pusht_zarr.py Outdated Show resolved Hide resolved
tests/test_datasets.py Show resolved Hide resolved
tests/test_datasets.py Show resolved Hide resolved
@@ -297,6 +289,7 @@ def test_flatten_unflatten_dict():
assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"


@pytest.mark.skip("TODO after v2 migration / removing hydra")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test test_backward_compatibility(repo_id): makes me think we should probably train diffusion policy on pusht before and after this PR to compare dataset v1 vs v2.

tests/test_datasets.py Show resolved Hide resolved
tests/test_policies.py Show resolved Hide resolved
@astroyat
Copy link

I tried training using the new dataset and see some errors in compute_stats.py, should d.stats be changed to d.meta.stats?

@aliberts aliberts merged commit 32eb0ce into main Nov 29, 2024
6 of 7 checks passed
@aliberts aliberts deleted the user/aliberts/2024_09_25_reshape_dataset branch November 29, 2024 18:04
DomThePorcupine pushed a commit to DomThePorcupine/lerobot that referenced this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ Dataset Something dataset-related ✨ Enhancement New feature or request
Projects
Status: Done
5 participants